A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples

Pseudotime analysis with single-cell RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many methods have been developed to infer the pseudotemporal trajectories of cells within a biological sample, it remains a challenge to compare pseudotemporal patterns with multiple samples (or replicates) across different experimental conditions. Here, we introduce Lamian, a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. Lamian can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions while adjusting for batch effects, and to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both real scRNA-seq and simulation data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.


Supplementary Figures
Figure S1.The statistical model is named as Lamian, a traditional Chinese hand-pulled noodle, based on the similarity between the model and the process of making Lamian.To make the Lamian noodles, one first makes a firm dough using wheat flour, and then pulls many thinner noodles from the dough.Finally, one coats the noodles with extra flour to prevent the noodles from sticking to each other.In the method Lamian, the population-level pattern (the thick orange curve) of gene expression dynamics in all samples resembles the dough.The sample-level gene expression dynamics are modeled as the population-level pattern plus some sample-level random variation, resulting in a curve for each sample (the thin curves).The sample-level curves resemble the thinner noodles coming from the dough.Finally, the observed cell-level gene expression values are modeled as the sample-level curve plus some cell-level random variation.Here the cells resemble the flour.Rows are genes identified using FDR < 0.05.Only XDE genes with significant mean shift, trend difference, or both are shown.In the erythroid lineage, 53% (or n=17) of the XDE genes identified by Lamian are on sex chromosomes (one-sided permutation test p-values for the enrichment of sex chromosome genes: 0.017 for chrX and 0.000 for chrY).In the lymphoid lineage, 62% (or n=18) of the XDE genes identified by Lamian are on sex chromosomes (one-sided permutation test p-values for the enrichment of sex chromosome genes: 0.001 for chrX and 0.000 for chrY).Source data are provided as a Source Data file.The cell cycle gene lists of S-phase (43 genes) and G2-M-phase (54 genes) were retrieved from Seurat v.3.2.1.The gene list of G1-phase was retrieved from GSEA G1_PHASE (15 genes).For each cell, S-, G2-M-and G1-phase scores were calculated using the average of the normalized log2-transformed gene expression across genes in the corresponding gene lists.In this way, with the cellular pseudotime, the pseudotemporal pattern of the cell cycle was obtained.For each sample, Pearson correlation coefficients between cell cycle scores and cell density along pseudotime were calculated.The boxplots show the distributions (centre: median; bounds of box: 1 st and 3 rd quartiles; bounds of whiskers: data points within 1.5 IQR from the box; minima; maxima) of correlation coefficients for different datasets and tree branches (each data point represents a sample).The sample sizes are n = 8 for each box in HCA, n = 114 for each box in COVID19, and n = 184 for each box in TB data.Source data are provided as a Source Data file.remove cell proportion significance percentage          Pseudotemporal gene expression patterns were fitted as in TDE analysis.The correlation coefficient was computed for each gene using its fitted group-level curves.(e) Similar to (d) but the pseudotemporal curves are fitted as in XDE analysis.In other words, the correlation was analyzed within each sample group (female or male) separately.(f) The proportion of Seurat+Lamian-reported TDE genes (FDR<0.05) that can be found in the top 100n Harmony+Lamian TDE genes (n = 1, 2, . . . ) ("test results").As a negative control, the gene list based on Harmony was also permuted to re-calculate the overlap ("permuted").The vertical bars with "*" denote the FDR<0.05cutoffs for Harmony+Lamian in the three lineages.(g) Similar to (f) but for XDE analysis.Source data are provided as a Source Data file.Pseudotemporal gene expression patterns were fitted as in TDE analysis.The correlation coefficient was computed for each gene using its fitted group-level curves.(e) Similar to (d) but the pseudotemporal curves are fitted as in XDE analysis.In other words, the correlation was analyzed within each sample group (female or male) separately.(f) The proportion of Seurat+Lamian-reported TDE genes (FDR<0.05) that can be found in the top 100n scVI+Lamian TDE genes (n = 1, 2, . . . ) ("test results").As a negative control, the gene list based on scVI was also permuted to re-calculate the overlap ("permuted").The vertical bars with "*" denote the FDR<0.05cutoffs for scVI+Lamian in the three lineages.(g) Similar to (f) but for XDE analysis.Source data are provided as a Source Data file.

E-step
In iteration t + 1, the E-step computes the Q-function Q(Θ|Θ (t) ) , which is the expectation of the complete data log-likelihood l(Θ) with respect to the conditional distribution of missing data u and σ 2 given the observed data y and old parameter values Thus, Here the inverse-Gamma distribution for σ 2 s |y s , Θ is derived based on P(σ 2 s |y s , Θ) = P(u s , σ 2 s |y s , Θ)du s ∝(σ 2 s ) −(C s +K+1)/2−α−1 exp − Based on the properties of inverse-Gamma distribution, we have By maximizing the Q-function with respect to the unknown parameters Θ, we obtain the new parameter estimates.
• η (t+1) : it is the solution to where ψ(•) is the digamma function.This can be solved using bound constrained optimization 3 .
• α (t+1) : • β (t+1) : • Ω (t+1) : (Note the connection between the integrand and the probability density function of multivariate t-distribution) Thus, the log of the observed data likelihood used to construct the likelihood ratio test statistic is where the parameters Θ are set to be the maximum likelihood estimates (MLE) obtained from the EM algorithm (i.e., Θ (t) from the last EM iteration).

Additional notes for filtering cells in HCA-BM data
For this dataset, we retained cells with ≥ 1000 expressed genes, ≤ 10% mitochondrial reads, and ≥ 5000 reads (Fig. S21a-c).The 10% mitochondrial reads cutoff is not a strong filter and retains most of the cells.The number of expressed genes cutoff was set as 1000, which is a strong filter and removes the majority of cells.However, applying less stringent cutoffs resulted in noisy low dimensional representation of cells that did not clearly capture the known biology.For example, setting the number of expressed gene cutoff to 500 allowed us to retain 255,547 cells.However, the three cell differentiation lineages were unclear in both Seurat-and Harmony-integrated results.Moreover, both methods were not able to identify any cluster annotated as hematopoietic stem cell (HSC), and a number of clusters were annotated as NA (unidentifiable cell types, colored by grey) (Fig. S21 e,f).These indicate that for this well-studied system, setting the cutoff to be 500 was not a good choice.Similarly, we also tried to set the cutoff to be the median number of expressed genes across cells.The results were also noisy and did not clearly capture the three lineages (Fig. S21 g,h).Since HCA-BM data serves as an example dataset for method illustration, we decided to apply the stringent cutoff 1000 so that we can capture the known biology and hence use the known biology to benchmark downstream differential expression analysis.For the read count cutoff, the number of reads and the number of expressed genes are highly correlated (Fig. S21d).We choose 5000 as the reads cutoff simply because most cells with ≥1000 expressed genes also have ≥5000 reads.In other words, 5000 reads cutoff is not a strong filter on top of the number of expressed gene filter and does not influence the filtering results much.

Figure S2 .Figure S3 .Figure S4 .
Figure S2.Marker gene expression levels in the three main lineages of hematopoietic stem cell differentiation in HCA-BM data.Source data are provided as a Source Data file.

Figure S5 .
Figure S5.Heatmaps of XDE genes along the (a) erythroid and (b) lymphoid lineages associated with sex (male, female) as a covariate found by Lamian's XDE test (Module 3) in the HCA-BM data.Rows are genes identified using FDR < 0.05.Only XDE genes with significant mean shift, trend difference, or both are shown.In the erythroid lineage, 53% (or n=17) of the XDE genes identified by Lamian are on sex chromosomes (one-sided permutation test p-values for the enrichment of sex chromosome genes: 0.017 for chrX and 0.000 for chrY).In the lymphoid lineage, 62% (or n=18) of the XDE genes identified by Lamian are on sex chromosomes (one-sided permutation test p-values for the enrichment of sex chromosome genes: 0.001 for chrX and 0.000 for chrY).Source data are provided as a Source Data file.

× p < 2 .Figure S7 .
Figure S6.TCD and XCD test (Module 4) results for the HCA-BM data on (a,b) erythroid and (c,d) lymphoid lineages.The pseudotime was split into 100 equally-spaced bins.Each dot represents the cell density in each bin of one sample (i.e., the number of cells in that bin divided by the total number of cells in that sample).Each curve represents the model-fitted temporal pattern of a sample.In TCD test (n = 8, LLR = 113.12(a) and 196.31 (c)), the curves are colored by individual samples.In XCD test (n = 8, LLR = 2.46 (b) and 6.52 (d)), they are colored by the sample group.One-sided p-values are shown.Source data are provided as a Source Data file.

Figure S8 .
Figure S8.Comparison of k-means and Gaussian mixture model clustering results of differential genes.(a)-(d) Clustering XDE genes in the in silico spike-in datasets used in Fig. 4b-d.Heatmaps show the confusion matrix of k-means and Gaussian mixture model (GMM) clustering results for data with increasing signal strength ((a) to (d) represents signal strength from 1 to 4, respectively).(e)-(h) Clustering TDE genes in the in silico spike-in datasets used in Fig. S14.Similar to (a)-(d) but on TDE genes.Overall, the clustering results from the two clustering methods become more similar as the signal strength increases.expr means expression.Source data are provided as a Source Data file.

Figure S9 .
Figure S9.Comparison of k-means and Louvain clustering results of differential genes.(a)-(d) Clustering XDE genes in the in silico spike-in datasets used in Fig. 4b-d.Heatmaps show the confusion matrix of k-means and Louvain clustering results for data with increasing signal strength ((a) to (d) represents signal strength from 1 to 4, respectively).(e)-(h) Clustering TDE genes in the in silico spike-in datasets used in Fig. S14.Similar to (a)-(d) but on TDE genes.Overall, the clustering results from the two clustering methods become more similar as the signal strength increases.expr means expression.Source data are provided as a Source Data file.

Figure S11 .Figure S12 .Figure S14 .
Figure S11.Example XDE genes in HCA-BM data's myeloid lineage.Each column represents a comparison between Lamian and another method.Within the column, the top row shows an example gene detected by the other method but not Lamian, and the bottom row shows an example gene detected by Lamian but not the other method.N/A means no genes are available in the corresponding category.Each plot shows fitted curves for individual samples color-coded by sample group.Source data are provided as a Source Data file.
in all samples in the first half pseudotime bins

Figure S15 .
Figure S15.Simulation study for evaluating differential cell density tests.(a)-(d) TCD test simulation.(a) Null simulation setting of TCD test.Cell proportion in each of the 100 bins along pseudotime (denoted as dots) and the model-fitted curve for each HCA-BM sample are shown.The pseudotime of the cells have been permuted so that there is no time-dependent pattern.(b) Similar to (a) except that 50% of the cells in the first half pseudotime bins in each of the samples have been removed.(c) A violin plot showing the distribution of TCD test p-values (dots, one-sided) obtained from 1,000 simulations where a certain proportion (x-axis) of the cells in the first half pseudotime bins in each sample have been removed.(d) A bar plot showing the percentage of simulations (out of 1,000) that have reported a TCD test p-value < 0.05 (one-sided).(e-h) Similar to (a-d) except that it is in the XCD test simulation.Here, instead of removing cells in all samples, we only remove cells in the group 1 samples (BM1,2,5,6); and we do not permute the pseudotime.Source data are provided as a Source Data file.

Figure S16 .Figure S18 .Figure S19 .
Figure S16.TDE and XDE genes and their enriched functions in TB data.(a) A heatmap showing the dynamic expression patterns of TDE genes along pseudotime.Each row represents a TDE gene.(b) Enriched GO terms of the TDE genes in each cluster shown in (a).(c)The model-fitted patterns of each XDE gene cluster are shown in the two columns on the right.Within each row, two clusters with the same trend difference (one with mean shift and one without) are labeled as a and b respectively.For example, Clusters 1a and 1b contain genes with the same trend difference pattern, but genes in 1b also have significant mean shift (i.e.meanSig).For each trend difference pattern, the difference between the male and female groups is shown in the column on the left.(d) Enriched GO terms of the XDE genes in each cluster shown in (c).GO analyses were conducted similarly to Fig.S4.Source data are provided as a Source Data file.

Figure S20 .
Figure S20.Lamian analysis of the HCA-BM data based on scVI integration and comparison with Seurat integration.(a-b) Cells plotted with top two latent variables from scVI, colored by clusters (a) or pseudotime (b).Pseudotime was inferred using the top 10 latent variables output by scVI.(c) Cell proportion on each lineage.The barplot shows the mean (pink bar) and mean ± SD (blue bars) of the branch cell proportion across n = 8 samples for each lineage.Cell proportions are displayed as dots and heatmap.(d) Distribution of the Pearson correlation coefficients between genes' pseudotemporal expression patterns fitted by Seurat+Lamian and those fitted by scVI+Lamian.Pseudotemporal gene expression patterns were fitted as in TDE analysis.The correlation coefficient was computed for each gene using its fitted group-level curves.(e) Similar to (d) but the pseudotemporal curves are fitted as in XDE analysis.In other words, the correlation was analyzed within each sample group (female or male) separately.(f) The proportion of Seurat+Lamian-reported TDE genes (FDR<0.05) that can be found in the top 100n scVI+Lamian TDE genes (n = 1, 2, . . . ) ("test results").As a negative control, the gene list based on scVI was also permuted to re-calculate the overlap ("permuted").The vertical bars with "*" denote the FDR<0.05cutoffs for scVI+Lamian in the three lineages.(g) Similar to (f) but for XDE analysis.Source data are provided as a Source Data file.

Figure S21 .
Figure S21.Quality control (QC) plots and the integration results from Harmony and Seurat with relaxed quality filters.(a)-(d) Current filtering criteria.(e)-(f) Integration results from Harmony (e) and Seurat RPCA (f) after relaxing the number of expressed genes per cell cutoff from 1000 to 500.In Harmony (e), 6/27 clusters are unidentifiable cell types (NA) which occupy 10831 out of 255547 cells.In Seurat (f), 7/28 clusters are unidentifiable cell types (NA) which occupy 11409 out of 255547 cells.(g)-(h) Integration results from Harmony (g) and Seurat RPCA (h) after changing the number of expressed genes cutoff from 1000 to the median number of expressed genes.In Harmony (g), 9 out of 29 clusters are NA, i.e. 11995 out of 164917 cells.In Seurat (h), 8 out of 27 clusters are NA, i.e. 9673 out of 164917 cells.Source data are provided as a Source Data file.